Add ControllerStartupLatency metric for SandboxClaims#522
Add ControllerStartupLatency metric for SandboxClaims#522k8s-ci-robot merged 3 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
| } | ||
| claim.Annotations[asmetrics.TraceContextAnnotation] = tc | ||
| if needObsPatch { | ||
| claim.Annotations[obsAnnotation] = time.Now().Format(time.RFC3339Nano) |
There was a problem hiding this comment.
You might consider just keeping an in memory map, but ... given we're already writing to the apiserver for the trace... sgtm
There was a problem hiding this comment.
I ran the latency test using a sync.Map to keep track of the times instead of this patch update, adding the values to the map with LoadOrStore at the beginning of the Reconcile loop and deleting when the metric is recorded to keep the map from growing definitely. It actually ran slower than sending a patch update to the apiserver. Granted I ran this test both times with tracing turned on, so I can test that again without tracing.
|
Thanks Ivy. I think we also need to have a version of the original metric where it optionally looks at a client provided timestamp (in a pre-defined annotation). OR we can skip emitting the claim_latency_metric if that annotation is not set. |
f750786 to
24ffa48
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aditya-shantanu, igooch The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0638598 to
e193293
Compare
This PR introduces a new metric,
agent_sandbox_claim_controller_startup_latency_ms, to provide higher precision tracking of SandboxClaim startup performance.Problem
Currently, startup latency is measured using the standard Kubernetes
creationTimestamp. However, this timestamp has one-second granularity. For fast-provisioning resources like SandboxClaims, where target latencies are often in the millisecond range, this granularity is too coarse and leads to inaccurate P50/P90 metrics.Proposed Solution
The controller now stamps a high-precision
controller-first-observed-atannotation during its first reconciliation cycle. The new metric measures the duration from this observation point to the "Ready" state.Notes for the reviewer